Chronicle Specials + Font Resize -

Integration of bioinformatics resources: Critical need today

Our Bureau, Mumbai | Thursday, July 31, 2003, 08:00 Hrs [IST]

As the biotechnology industry worldwide is emerging at a significant size and pace today, the resource pool that provides the infinite data is so critical. At the same time, as the industry struggles to deal with increasingly greater volumes of complex and often interrelated data, the need for integration in the management of bioinformatics data is growing more acute than ever. In order to maximize the usefulness and reusability of a data source, a code of conduct for data providers has been formulated and the principles of which are gaining widespread support globally.

Although the basic elements now appear to be in place for truly integrated bioinformatics, at the technological point of view, the services of worldwide web seems to offer the infrastructure to support the integration and accessibility of data that researchers need. Using this as the basic architecture, there are serious attempts in every country to build a reliable and stable network.

In India, the recently initiated National Biotechnology Information System Network by the Department of Biotechnology (BTISNET) linking 11 prestigious R&D institutions in the country is one of such attempts in the world. This virtual private network initiated by DBT is to enhance the activities of bioinformatics and support its national wide efforts to develop human resources in bioinformatics in a big way.

The DBT effort is to be of much help to R&D driven biotechnology companies who are going for collaborative research with the major distribution information centers that are to be linked by the network. According to sources involved in this project the institutions that are to share bioinformatics related information and be part of the intranet will be able to share over 100 databases, and help more than 10,000 users to access the data from all parts of the country. It can also help distributing various study material needed for bioinformatics.

Basically, bioinformatics that has a key role to play in all genomics, proteomics and sequencing related activities, deals with the whole gamut of biological data. It covers the development of data analysis tools, modeling of biological macromolecules and other complexes and has applications in metabolic pathways as well as in designing of new drug molecules, peptide vaccines, proteins etc.

In this context, a recent thought by world experts in bioinformatics to have a Code of Conduct for biological data providers would allow an easy integration of bioinformatics resources. Their code provides a solid framework within which both providers and consumers of data can exchange the data and develop beneficial relationships within the bioinformatics community.

The six tenets of this code of conduct can be summarized as reuse of existing code and make use of open source resources, use of existing data formats to avoid reinventing the wheel, designing simple and sensible new data formats and avoid proprietary binary data types, better understanding of interfaces between an application and the data source represent formal agreements between the data provider and the data consumer, encouraging choice for data consumers when designing interfaces and supporting adhoc queries.

At present there are several technologies are available to advance the goals of this code of conduct. Database federations and data warehouses traditionally have been used to integrate disparate data sources. Yet, the advent of web services offers new possibilities above and beyond what the traditional methods are capable of. A database federation can have a global (federation) schema that provides users with a uniform view of all databases in the federation and thus insulates them from the component databases. For example, if a user runs a query through a federation comprising 10 databases, results will be received as if from a single database in a common format, rather than 10 different sets of results.

A data warehouse represents the materialization of a global scheme in that it is loaded periodically with data from component databases. It organizes these disparate databases into a data warehouse, with or without a common schema. Some of the more established examples include genomic unified schema (GUS) - a data warehouse that attempts to predict protein function based on protein domains - and EnsEMBL, a collaborative project of the European Molecular Biology Laboratory (EMBL, Heidelberg, Germany), European Bioinformatics Institute (EBI, Cambridge, United Kingdom) and the Sanger Center (Cambridge, United Kingdom) that automatically tracks sequenced fragments of the human genome and assembles them into longer stretches.

Web services are intended to enable the exchange of data among heterogeneous systems in humanreadable, platformneutral XML message form. Web services architecture represents an attempt to allow remote access of data and application logic in a loosely combined fashion. Previous attempts of achieving this, such as with distributed COM (DCOM) and Java/Remote method invocation (RMI), required a tight integration between the client and server and used platform and implementation specific binary data formats.

Since the subject area of bioinformatics can be defined as the application of techniques from computer science to solve problems in molecular biology. This exciting area is a relatively young field, and the pace of research is driven by the large and rapidly increasing amount of data being produced from, for example, efforts to sequence the genomes of a variety of organisms. The areas where computer science can be applied range from assembly of sequence fragments, analysis of DNA, RNA and protein sequences, prediction and analysis of protein sequence and function, and the analysis and simulation of general metabolic function and regulation. In the words of Dr. Hwa A. Lim (HAL), Chairperson and CEO of D'Trends, Inc, bioinformatics is certainly not number-crunching for molecular biologists, but is about the application of techniques from computer science such as modelling, simulation, data abstraction, data manipulation and pattern discovery techniques in order to analyse biological data. The data generated by the experimental scientists requires annotation and detailed analysis in order to turn it into knowledge which can then be applied to improving health care via, for example, new drugs and gene therapy, medical practices, food production - all of which are now high-profile issues nationally.

- (Content Courtesy: D'Trends, Inc and PharmaGenomics)